Social Graph Final project
An analysis and visualization for security people on Twitter
In this project we focus on security people - as the world of IT becomes more widespread and increasingly complex, many security vulnerabilities arise which could potentially have an enormous impact on a company or worse case a whole country. There are a few guardians who dedicate their life to secure the IT infrastructure so we all can sleep peacefully.
We use the data from Twitter and build the network of security people based the friend concept of Twitter. With the network in hand, we perform a community calculation to find out whether these people are following into groups. And also understand what are they talking about by building the word cloud for each community. Finally, we detected the sentimentality of each community by analyzing the text of each person's biography that falls in the community.
Twitter is the main source of our dataset. We used Twitter API combined with Tweepy library to download our dataset.
After crawling, cleaning, and formatting, 2 million rows of records have remained and gives us information like the name of the security people, the friends relationship with others which can be used to build the network later.
The overall size of the raw dataset is over 130 MB, which was extracted from over 3543 tweets. The 3543 tweets is just a start point of our dataset. By extracting the authors of the tweets and retrieve their friends, we get the main data of the network. As mentioned before, we also download all the biography of each people for the sentimentality analysis.
The security people themselves will be the nodes in our network, and the concept of friends in Twitter will be the edges.
Our network is a graph to showcase who are friends with each other.
# collapse-hide
fig = plt.figure(figsize=(20, 10))
nx.draw_networkx_nodes(g, positions, node_size=node_sizes, alpha=0.4)
nx.draw_networkx_edges(g, positions, edge_color="black", alpha=0.05, width=0.5)
plt.title("Security People Network")
plt.axis('off')
fig.show()
# collapse-hide
print("Top ten nodes sorted by degree")
sorted(g.degree, key=lambda x: x[1], reverse=True)[:10]
# collapse-hide
print('Number of nodes', g.number_of_nodes())
print('Number of edges', g.number_of_edges())
We want to explore how many security people there are and find out if they generally know each other. Find out if they can be split into some communities. Within these communities, we could try to use some Natural Language Processing to detect which community speaks most loudly and potentially see if there is a sentimentally difference.
# collapse-hide
hist, bin_edges = np.histogram(list(len(com) for com in security_communities.values()))
center = ((bin_edges[:-1] + bin_edges[1:]) / 2).round()
fig = plt.figure(figsize=(20, 10))
plt.bar(center, hist)
plt.title("Security community sizes")
plt.ylabel("Count")
plt.xlabel("Community size")
plt.xticks(center)
fig.show()
top_5_largest_communites = sorted(security_communities.values(), key=len, reverse=True)[:5]
with open("bios.csv", newline="") as f:
csv_reader = csv.DictReader(f)
bio_by_name = {row["screen_name"]: row["bio"] for row in csv_reader}
bios_by_community = {i: [bio_by_name.get(name, "") for name in members] for i, members in enumerate(top_5_largest_communites)}
# collapse-hide
wordcloud = WordCloud(
max_words=100,
collocations=False,
)
fig, axs = plt.subplots(nrows=len(tfidfs), ncols=1, figsize=(20,20))
for i, tfidf in enumerate(tfidfs):
wordcloud.generate_from_frequencies(tfidf)
axs[i].set_title(f"Community {i+1}")
axs[i].imshow(wordcloud, interpolation="bilinear")
axs[i].axis("off")
fig.show()
# collapse-hide
def compute_average_sentiment(tokens):
"""compute_average_sentiment returns the average sentiment value of the tokens.
Each token in tokens must be in lowercase.
"""
sentiment = 0.0
if not len(tokens):
return sentiment
avg = np.nan_to_num(words_of_happiness[words_of_happiness["word"].isin(tokens)]["happiness_average"].mean())
return avg
communities = {i: set(members) for i, members in enumerate(top_5_largest_communites)}
text_of_communities = collections.defaultdict(str)
with open("sentiment_tweets.csv", newline="") as f:
csv_reader = csv.DictReader(f)
for row in csv_reader:
for i, members in communities.items():
if row["screen_name"] in members:
text_of_communities[i] += f" {row['tweets']}"
sentiment_of_communities = {k: compute_average_sentiment(bag_of_words(v)) for k, v in text_of_communities.items()}
for com, sentiment in sentiment_of_communities.items():
print(f"Community {com} have a sentiment value of {sentiment}")